Skip to content

prevent infinity waiting on remote DC in multiregional mode when failover #12071#13295

Open
MarkSh1 wants to merge 2 commits into
apple:mainfrom
MarkSh1:fix-dd-hangon-when-multerigional-failover
Open

prevent infinity waiting on remote DC in multiregional mode when failover #12071#13295
MarkSh1 wants to merge 2 commits into
apple:mainfrom
MarkSh1:fix-dd-hangon-when-multerigional-failover

Conversation

@MarkSh1
Copy link
Copy Markdown
Contributor

@MarkSh1 MarkSh1 commented May 28, 2026

In a degraded multi-region, with the primary DC disabled, recovery can reach ACCEPTING_COMMITS but not ALL_LOGS_RECRUITED.
In this case, DD should continue initializing.
Replaced the wait criterion for starting DD remote path:

  • instead of waiting for ALL_LOGS_RECRUITED wait for ACCEPTING_COMMITS

It was like this:

{
  "regions":[
    {
        "datacenters":[
          {"id":"dc1","priority":2},
          {"id":"dc11","satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_single",
         "satellite_logs": 1
    },
    {
        "datacenters":[
          {"id":"dc2","priority":1},
          {"id":"dc21","satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_single",
         "satellite_logs": 1
    }
  ]
}

/etc/foundationdb/foundationdb.conf

...
[fdbserver.4500]
 datacenter-id = dc1
 machine-id = m0
[fdbserver.4503]
 datacenter-id = dc1
 machine-id = m3
[fdbserver.4501]
 datacenter-id = dc11
 machine-id = m1

[fdbserver.4502]
 datacenter-id = dc2
 machine-id = m2
[fdbserver.4504]
 datacenter-id = dc2
 machine-id = m4
[fdbserver.4505]
 datacenter-id = dc21
 machine-id = m5

After DC1 (4500 and 4503) is turned off
Always in the following state:

Replication health - (Re)initializing automatic data distribution
...

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.101.97.25:4500  (unreachable)
  10.101.97.25:4501  (reachable)
  10.101.97.25:4502  (reachable)

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Log engine             - ssd-2
  Coordinators           - 3
  Usable Regions         - 2
  Regions:
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc11
        Satellite Redundancy Mode     - one_satellite_single
        Satellite Logs                - 1
    Primary -
        Datacenter                    - dc2
        Satellite datacenters         - dc21
        Satellite Redundancy Mode     - one_satellite_single
        Satellite Logs                - 1

Cluster:
  FoundationDB processes - 4
  Zones                  - 4
  Machines               - 4
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces,
otherwise storage servers may never be able to catch up.
  Old log epoch: 272 begin: 133553941258 end: 133717782199, missing log interfaces(id,address): 058007dc5531
bf0c, ac5126871aa7a72b,

  Server time            - 05/28/26 11:38:44

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  System keyspace sizes  - unknown
  Disk space used        - 3.291 GB

Operating space:
  Storage server         - 153.6 GB free on most full server
  Log server             - 153.6 GB free on most full server

Workload:
  Read rate              - 13 Hz
  Write rate             - 0 Hz
  Transactions started   - 6 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.101.97.25:4501      (  2% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.101.97.25:4502      (  0% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.101.97.25:4504      (  2% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.101.97.25:4505      (  1% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )

Client time: 05/28/26 11:38:41

And now it is like this:


Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.101.97.25:4500  (unreachable)
  10.101.97.25:4501  (reachable)
  10.101.97.25:4502  (reachable)

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Log engine             - ssd-2
  Coordinators           - 3
  Usable Regions         - 2
  Regions:
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc11
        Satellite Redundancy Mode     - one_satellite_single
        Satellite Logs                - 1
    Primary -
        Datacenter                    - dc2
        Satellite datacenters         - dc21
        Satellite Redundancy Mode     - one_satellite_single
        Satellite Logs                - 1

Cluster:
  FoundationDB processes - 4
  Zones                  - 4
  Machines               - 4
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces,
otherwise storage servers may never be able to catch up.
  Old log epoch: 272 begin: 133553941258 end: 133717782199, missing log interfaces(id,address): 058007dc5531
bf0c, ac5126871aa7a72b,

  Server time            - 05/28/26 11:57:53

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1.306 GB
  System keyspace sizes  - 0 MB
  Disk space used        - 3.291 GB

Operating space:
  Storage server         - 153.6 GB free on most full server
  Log server             - 153.6 GB free on most full server

Workload:
  Read rate              - 16 Hz
  Write rate             - 0 Hz
  Transactions started   - 9 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.101.97.25:4501      (  2% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.101.97.25:4502      (  0% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.101.97.25:4504      (  1% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.101.97.25:4505      (  1% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 0.1 GB / 8.0 GB RAM  )

Client time: 05/28/26 11:57:50

…re are no tlog processes in it but usableRegions=2
@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented May 28, 2026

It's a little hard for me to think through all the implications of this. Let's start with running CIs.

@gxglass gxglass closed this May 28, 2026
@gxglass gxglass reopened this May 28, 2026
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 4a22692
  • Duration 0:03:46
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 4a22692
  • Duration 0:04:22
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 4a22692
  • Duration 0:04:23
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 4a22692
  • Duration 0:04:24
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 4a22692
  • Duration 0:04:28
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 4a22692
  • Duration 0:29:59
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 4a22692
  • Duration 0:51:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@MarkSh1
Copy link
Copy Markdown
Contributor Author

MarkSh1 commented May 29, 2026

@gxglass , could you please restart CI ? I've applied the clang format to DataDistribution.cpp.

@gxglass gxglass closed this May 29, 2026
@gxglass gxglass reopened this May 29, 2026
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: b363e11
  • Duration 0:22:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: b363e11
  • Duration 0:34:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: b363e11
  • Duration 0:46:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: b363e11
  • Duration 0:46:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: b363e11
  • Duration 0:48:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: b363e11
  • Duration 0:50:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: b363e11
  • Duration 1:06:26
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented May 29, 2026

Here's what the local AI said about this. I don't know how much weight to give to this line of thinking, but here is a concern: this code change could have easily been made in the prior decade, so I'm wondering if there are other reasons not to do it.

pr13295-review.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants